Chronic Diseases Data Science Project
  • about Michael
  • Codes and Data
  • Introduction
  • Data Gathering
  • Naïve Bayes
  • Clustering
  • Dimensionality reduction
  • Decision Tress
  • Natual Language Processing
  • Conclusions

On this page

  • Introductions
    • Decision Trees:
    • Random Forest:
  • Methods

Decision Tree

  • Show All Code
  • Hide All Code

  • View Source
Code
import sklearn
from sklearn import datasets
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import random
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

Introductions

Decision Trees:

A Decision Tree is a tree-like structure where each internal node represents a feature, each branch represents a decision rule, and each leaf node represents the tree outcome. The top node in a tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in a recursive manner called recursive partitioning. This tree-like structure helps in decision making.

Random Forest:

Random Forest is an ensemble learning method, where multiple decision trees are combined to increase the overall performance. While a single decision tree might lead to overfitting, Random Forest averages multiple trees to reduce overfitting and improve prediction accuracy. Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features.

Methods

For this part of the analysis, I will be applying both classification and regression for decision tree analysis, I will be applying the classification methods to binary data Cardiovascular dieases (CVD) and regressino methods to continuous numeric data GLUCOSE.

  • Classification
  • Regression

Data and Class Distribution

For this part of the analysis, I will be using the Framingham Heart Study dataset@RProjectFramingham for both Decision Tree and Random Forest Analysis, the target variable for this analysis would be "CVD", which is Cardio Vascular diseases.

Code
# Load the Frammingham Heart Study data set
data = pd.read_csv("data/frmgham2.csv")
data.head()

# Split target variables and predictor variables
y = data['CVD']
x = data.drop('CVD', axis=1)
x = x.iloc[:, 1:]
x = x.values
y = y.values


# Calculating class distribution for 'DIABETES'
cd = data['CVD'].value_counts(normalize=True) * 100
print(cd)
CVD
0    75.066655
1    24.933345
Name: proportion, dtype: float64

The results show binary data, about roughly 75% of data goes to 0 which is negative test for CVD, and about 25% of data goes to 1 which is positive test for CVD. We have a much higher percentages of data on negative tests, this could lead to a biased model favoring the majority class, we need a more detailed evaluation part to address this issue.

EDA

Code
data_eda = data.iloc[:, 1:]
print(data_eda.describe())
                SEX       TOTCHOL           AGE         SYSBP         DIABP  \
count  11627.000000  11218.000000  11627.000000  11627.000000  11627.000000   
mean       1.568074    241.162418     54.792810    136.324116     83.037757   
std        0.495366     45.368030      9.564299     22.798625     11.660144   
min        1.000000    107.000000     32.000000     83.500000     30.000000   
25%        1.000000    210.000000     48.000000    120.000000     75.000000   
50%        2.000000    238.000000     54.000000    132.000000     82.000000   
75%        2.000000    268.000000     62.000000    149.000000     90.000000   
max        2.000000    696.000000     81.000000    295.000000    150.000000   

           CURSMOKE       CIGPDAY           BMI      DIABETES        BPMEDS  \
count  11627.000000  11548.000000  11575.000000  11627.000000  11034.000000   
mean       0.432528      8.250346     25.877349      0.045584      0.085554   
std        0.495448     12.186888      4.102640      0.208589      0.279717   
min        0.000000      0.000000     14.430000      0.000000      0.000000   
25%        0.000000      0.000000     23.095000      0.000000      0.000000   
50%        0.000000      0.000000     25.480000      0.000000      0.000000   
75%        1.000000     20.000000     28.070000      0.000000      0.000000   
max        1.000000     90.000000     56.800000      1.000000      1.000000   

       ...           CVD      HYPERTEN        TIMEAP        TIMEMI  \
count  ...  11627.000000  11627.000000  11627.000000  11627.000000   
mean   ...      0.249333      0.743270   7241.556893   7593.846736   
std    ...      0.432646      0.436848   2477.780010   2136.730285   
min    ...      0.000000      0.000000      0.000000      0.000000   
25%    ...      0.000000      0.000000   6224.000000   7212.000000   
50%    ...      0.000000      1.000000   8766.000000   8766.000000   
75%    ...      0.000000      1.000000   8766.000000   8766.000000   
max    ...      1.000000      1.000000   8766.000000   8766.000000   

           TIMEMIFC       TIMECHD      TIMESTRK       TIMECVD       TIMEDTH  \
count  11627.000000  11627.000000  11627.000000  11627.000000  11627.000000   
mean    7543.036725   7008.153608   7660.880021   7166.082996   7854.102950   
std     2192.120311   2641.344513   2011.077091   2541.668477   1788.369623   
min        0.000000      0.000000      0.000000      0.000000     26.000000   
25%     7049.500000   5598.500000   7295.000000   6004.000000   7797.500000   
50%     8766.000000   8766.000000   8766.000000   8766.000000   8766.000000   
75%     8766.000000   8766.000000   8766.000000   8766.000000   8766.000000   
max     8766.000000   8766.000000   8766.000000   8766.000000   8766.000000   

            TIMEHYP  
count  11627.000000  
mean    3598.956395  
std     3464.164659  
min        0.000000  
25%        0.000000  
50%     2429.000000  
75%     7329.000000  
max     8766.000000  

[8 rows x 38 columns]

Correlation Matrix Heatmap

Code
corr = data.corr()
sns.set_theme(style="white")
f, ax = plt.subplots(figsize=(11, 9))  # Set up the matplotlib figure
cmap = sns.diverging_palette(230, 20, as_cmap=True)     # Generate a custom diverging colormap
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr,  cmap=cmap, vmin=-1, vmax=1, center=0,
        square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show();

Baseline model for comparison

Code
## RANDOM CLASSIFIER 
def random_classifier(y_data):
    ypred=[];
    max_label=np.max(y_data); #print(max_label)
    for i in range(0,len(y_data)):
        ypred.append(int(np.floor((max_label+1)*np.random.uniform(0,1))))

    print("-----RANDOM CLASSIFIER-----")
    print("count of prediction:",Counter(ypred).values()) # counts the elements' frequency
    print("probability of prediction:",np.fromiter(Counter(ypred).values(), dtype=float)/len(y_data)) # counts the elements' frequency
    print("accuracy",accuracy_score(y_data, ypred))
    print("percision, recall, fscore,",precision_recall_fscore_support(y_data, ypred))

np.random.seed(42)    
random.seed(42)
random_classifier(y)
-----RANDOM CLASSIFIER-----
count of prediction: dict_values([5874, 5753])
probability of prediction: [0.50520341 0.49479659]
accuracy 0.5015051173991572
percision, recall, fscore, (array([0.7495744 , 0.24821832]), array([0.50446838, 0.49258365]), array([0.60306807, 0.33009709]), array([8728, 2899], dtype=int64))
  • Interpretation

Count of Prediction: It shows the number of times each class label (0 and 1) was predicted. The values [5776, 5851] indicate that out of all predictions, 5874 were classified as class ‘0’ and 5851 as class ‘1’.

Probability of Prediction: This represents the proportion of each class label in the predictions. The values [0.50520341 0.49479659] suggest that approximately 50% of predictions are class ‘0’ which is negative, and 50% are class ‘1’, whcih is positive. This near-even split indicates that the classifier is randomly assigning class labels with almost equal probability to each class.

Accuracy: An accuracy of approximately 50% is observed. In the context of a binary classifier, an accuracy close to 50% suggests that the classifier’s performance is only slightly better than random guessing. This is expected, given the nature of the random classifier.

Precision: Precision for each class (0.7495744 for class ‘0’, 0.24821832 for class ‘1’) indicates the proportion of correctly predicted positive observations out of all predictions in that class. Higher precision for class ‘0’ suggests that the classifier is more reliable when it predicts an instance as class ‘0’ compared to class ‘1’.

Recall: Recall (0.49702108 for class ‘0’, 0.50446838 for class ‘1’) is the proportion of actual positives correctly identified. The values are close to 50%, showing the classifier’s limited capability in correctly identifying true cases for each class.

F-Score: F-Score is the harmonic mean of precision and recall. The values (0.60306807 for class ‘0’, 0.33009709 for class ‘1’) indicate the balance between precision and recall for each class, with class ‘0’ having a better balance compared to class ‘1’.

Model Tuning

Partion data to train/test split

Code
from sklearn.model_selection import train_test_split
test_ratio=0.2
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_ratio, random_state=0)
y_train=y_train.flatten()
y_test=y_test.flatten()

print("x_train.shape        :",x_train.shape)
print("y_train.shape        :",y_train.shape)

print("X_test.shape     :",x_test.shape)
print("y_test.shape     :",y_test.shape)
x_train.shape        : (9301, 37)
y_train.shape        : (9301,)
X_test.shape     : (2326, 37)
y_test.shape     : (2326,)

Hyper-Parameter tuning for Decision Tree Classification ((max_depth))

Code
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier


# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS 
hyper_param=[]
train_error=[]
test_error=[]

# LOOP OVER HYPER-PARAM
for i in range(1,40):
    # INITIALIZE MODEL 
    model = DecisionTreeClassifier(max_depth=i)

    # TRAIN MODEL 
    model.fit(x_train,y_train)

    # OUTPUT PREDICTIONS FOR TRAINING AND TEST SET 
    yp_train = model.predict(x_train)
    yp_test = model.predict(x_test)

    # shift=1+np.min(y_train) #add shift to remove division by zero 
    err1=mean_absolute_error(y_train, yp_train) 
    err2=mean_absolute_error(y_test, yp_test) 
    
    # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))
    # err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test))

    hyper_param.append(i)
    train_error.append(err1)
    test_error.append(err2)

    if(i==1 or i%10==0):
        print("hyperparam =",i)
        print(" train error:",err1)
        print(" test error:" ,err2)
hyperparam = 1
 train error: 0.15138157187399204
 test error: 0.15047291487532244
hyperparam = 10
 train error: 0.0
 test error: 0.0030094582975064487
hyperparam = 20
 train error: 0.0
 test error: 0.0021496130696474634
hyperparam = 30
 train error: 0.0
 test error: 0.0030094582975064487
  • For max_depth = 1, both the training and test errors are relatively high, with values around 0.15. This suggests that the decision tree is underfitting the data, the depth of tree needs to higer for better model fitting.

  • As the max_depth increase to 10 and beyond, both the training and test errors decrease significantly. When max_depth = 10, the training error is very low (close to 0), indicating that the model is fitting the training data extremely well. However, the test error remains relatively low, suggesting that the model is generalizing reasonably well to unseen data.

  • Interestingly, as max_depth furthur increase to 20 and 30, the training error remains very low (close to 0), but the test error starts to increase slightly. This is a sign of overfitting. The model is capturing the data too well, we don’t want this to happen for generalization purposes.

  • Overall, the results show that a max_depth of around 10 appears to be a good choice because it achieves low test error without overfitting the training data.

  • Convergence plot

Code
plt.plot(hyper_param,train_error ,linewidth=2, color='k')
plt.plot(hyper_param,test_error ,linewidth=2, color='b')

plt.xlabel("Depth of tree (max depth)")
plt.ylabel("Training (black) and test (blue) MAE (error)")

i=1
print(hyper_param[i],train_error[i],test_error[i])
2 0.07386302548113106 0.06835769561478934

  • The convergence plot clearly demonstrates that as the max_depth hyperparameter increases, both training and test errors demonstrate a pattern of stabilization. This observation alligns with the assumptions made in the previous section. Specifically, it indicates that beyond a certain depth, the decision tree model ceases to substantially improve its fit to the training data, and we could infer that increasing the max_depth further may lead to overfitting, as the model becomes overly complex and starts fitting noise in the training data, while failing to improve its generalization. Hence, the plot supports the notion that there exists an optimal max_depth value that strikes a balance between model complexity and performance on the test dataset.

Hyper-Parameter tuning for Decision Tree Classification (min_samples_splitint)

Code
# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS 
hyper_param=[]
train_error=[]
test_error=[]

# LOOP OVER HYPER-PARAM
for i in range(2,100):
    # INITIALIZE MODEL 
    model = DecisionTreeClassifier(min_samples_split=i)

    # TRAIN MODEL 
    model.fit(x_train,y_train)

    # OUTPUT PREDICTIONS FOR TRAINING AND TEST SET 
    yp_train = model.predict(x_train)
    yp_test = model.predict(x_test)

    # shift=1+np.min(y_train) #add shift to remove division by zero 
    err1=mean_absolute_error(y_train, yp_train) 
    err2=mean_absolute_error(y_test, yp_test) 
    
    # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))
    # err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test))

    hyper_param.append(i)
    train_error.append(err1)
    test_error.append(err2)

    if(i%10==0):
        print("hyperparam =",i)
        print(" train error:",err1)
        print(" test error:" ,err2)
hyperparam = 10
 train error: 0.0005375766046661649
 test error: 0.003869303525365434
hyperparam = 20
 train error: 0.0006450919255993979
 test error: 0.0030094582975064487
hyperparam = 30
 train error: 0.0008601225674658639
 test error: 0.0034393809114359416
hyperparam = 40
 train error: 0.0016127298139984947
 test error: 0.004299226139294927
hyperparam = 50
 train error: 0.0016127298139984947
 test error: 0.004299226139294927
hyperparam = 60
 train error: 0.0016127298139984947
 test error: 0.004299226139294927
hyperparam = 70
 train error: 0.0017202451349317277
 test error: 0.004299226139294927
hyperparam = 80
 train error: 0.0017202451349317277
 test error: 0.004299226139294927
hyperparam = 90
 train error: 0.0017202451349317277
 test error: 0.004299226139294927
  • The min_samples_split parameter controls the minimum number of samples required to split an internal node during the construction of the decision tree.

  • As min_samples_split increases from 10 to 20, both the training and test errors generally decrease. This suggests that when internal nodes require a larger number of samples to split.

  • Beyond min_samples_split = 20, the training error remains low or slightly increases, indicating that the model still fits the training data well, however, the erros for test data follows a clear wave-like pattern.

  • Overall, the results suggest that a min_samples_split value around 30 provides a good balance between model complexity and generalization.

Convergence Plot

Code
plt.plot(hyper_param,train_error ,linewidth=2, color='k')
plt.plot(hyper_param,test_error ,linewidth=2, color='b')

plt.xlabel("Minimum number of points in split (min_samples_split)")
plt.ylabel("Training (black) and test (blue) MAE (error)")
Text(0, 0.5, 'Training (black) and test (blue) MAE (error)')

The convergence plot highlights a critical observation: there exists a specific range of min_samples_split values where the model performs well on both the training and test datasets. However, as min_samples_split goes beyond this range, the model’s performance on the test data deteriorates noticeably.

In practical terms, this indicates that we should seek a min_samples_split value that does not result in a significant drop in test data performance. The objective is to find the sweet spot where the model maintains good generalization to unseen data without overfitting or underfitting.

Re-train with optimal parameters

Code
# INITIALIZE MODEL 
model = DecisionTreeClassifier(max_depth=10, min_samples_split=30)
model.fit(x_train,y_train)                     # TRAIN MODEL 


# OUTPUT PREDICTIONS FOR TRAINING AND TEST SET 
yp_train = model.predict(x_train)
yp_test = model.predict(x_test)

err1=mean_absolute_error(y_train, yp_train) 
err2=mean_absolute_error(y_test, yp_test) 
    
print(" train error:",err1)
print(" test error:" ,err2)
 train error: 0.0008601225674658639
 test error: 0.0034393809114359416

Parity Plot

  • Plotting y_pred vs y_data lets you see how good the fit is

  • The closer to the line y=x the better the fit (ypred=ydata –> prefect fit)

Code
plt.plot(y_train,yp_train ,"o", color='k')
plt.plot(y_test,yp_test ,"o", color='b')
plt.plot(y_train,y_train ,"-", color='r')

plt.xlabel("y_data")
plt.ylabel("y_pred (blue=test)(black=Train)")
Text(0, 0.5, 'y_pred (blue=test)(black=Train)')

  • Training Data Fit: The black dots are closer to the red line, indicating a good fit for the training data.

  • Test Data Fit: The blue dots are spread out, indicating more variance in the fit for the test data compared to the training data.

  • Overall Performance: Since some blue points are far from the red line, it suggests the model may not generalize well to new, unseen data, possibly overfitting to the training data.

Plot Tree

Code
from sklearn import tree
def plot_tree(model):
    fig = plt.figure(figsize=(25,20))
    _ = tree.plot_tree(model, 
                    filled=True)
    plt.show()

plot_tree(model)

Conclusions(The final reaults are discussed with each section)

  • The decision tree with classification analysis gave us insights into heart disease risks, which performed quite well when looking back at the data we trained it on. However, when faced with new, unseen data, the model’s predictions were inconsistent. This suggests that while our model has learned well from past data, it needs to be better tuned to handle new information effectively.

  • To improve, we aim to adjust our model’s complexity. When the model is too simple, and we miss important patterns, when the model is too complex, and we face the risk if overfitting instead of generalization. The goal is to build a model that not only learns from the past but is also ptovide usefull insight for generalization.

Data understanding

For this part of the analysis, I will be using the Framingham Heart Study dataset@RProjectFramingham for both Decision Tree and Random Forest Analysis, the target variable for this analysis would be GLUCOSE, in mmol/L (millimoles per liter).

Code
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

# Load the Frammingham Heart Study data set
data = pd.read_csv("data/frmgham2.csv")
print(data.head())
data.dropna(inplace=True)

# Split target variables and predictor variables
y = data['GLUCOSE']
x = data.drop('GLUCOSE', axis=1)
x = x.iloc[:, 1:]
x = x.values
y = y.values

# Normalize 
x=0.1+(x-np.min(x,axis=0))/(np.max(x,axis=0)-np.min(x,axis=0))
y=0.1+(y-np.min(y,axis=0))/(np.max(y,axis=0)-np.min(y,axis=0))
   RANDID  SEX  TOTCHOL  AGE  SYSBP  DIABP  CURSMOKE  CIGPDAY    BMI  \
0    2448    1    195.0   39  106.0   70.0         0      0.0  26.97   
1    2448    1    209.0   52  121.0   66.0         0      0.0    NaN   
2    6238    2    250.0   46  121.0   81.0         0      0.0  28.73   
3    6238    2    260.0   52  105.0   69.5         0      0.0  29.43   
4    6238    2    237.0   58  108.0   66.0         0      0.0  28.50   

   DIABETES  ...  CVD  HYPERTEN  TIMEAP  TIMEMI  TIMEMIFC  TIMECHD  TIMESTRK  \
0         0  ...    1         0    8766    6438      6438     6438      8766   
1         0  ...    1         0    8766    6438      6438     6438      8766   
2         0  ...    0         0    8766    8766      8766     8766      8766   
3         0  ...    0         0    8766    8766      8766     8766      8766   
4         0  ...    0         0    8766    8766      8766     8766      8766   

   TIMECVD  TIMEDTH  TIMEHYP  
0     6438     8766     8766  
1     6438     8766     8766  
2     8766     8766     8766  
3     8766     8766     8766  
4     8766     8766     8766  

[5 rows x 39 columns]
C:\Users\xinzh\AppData\Local\Temp\ipykernel_35996\2745761356.py:18: RuntimeWarning:

invalid value encountered in divide

EDA

Code
data_eda = data.iloc[:, 1:]
print(data_eda.describe())
               SEX      TOTCHOL          AGE        SYSBP        DIABP  \
count  2236.000000  2236.000000  2236.000000  2236.000000  2236.000000   
mean      1.565742   237.598837    60.166816   139.027728    80.999106   
std       0.495770    45.288008     8.295838    22.343693    11.330523   
min       1.000000   112.000000    44.000000    86.000000    30.000000   
25%       1.000000   206.000000    53.000000   122.000000    73.000000   
50%       2.000000   235.000000    59.000000   136.000000    80.000000   
75%       2.000000   265.000000    67.000000   153.000000    88.000000   
max       2.000000   625.000000    81.000000   246.000000   130.000000   

          CURSMOKE      CIGPDAY          BMI     DIABETES       BPMEDS  ...  \
count  2236.000000  2236.000000  2236.000000  2236.000000  2236.000000  ...   
mean      0.348837     6.864490    25.752227     0.070662     0.144454  ...   
std       0.476709    11.622125     3.883322     0.256317     0.351629  ...   
min       0.000000     0.000000    14.430000     0.000000     0.000000  ...   
25%       0.000000     0.000000    23.187500     0.000000     0.000000  ...   
50%       0.000000     0.000000    25.380000     0.000000     0.000000  ...   
75%       1.000000    10.000000    27.845000     0.000000     0.000000  ...   
max       1.000000    80.000000    46.520000     1.000000     1.000000  ...   

               CVD     HYPERTEN       TIMEAP       TIMEMI     TIMEMIFC  \
count  2236.000000  2236.000000  2236.000000  2236.000000  2236.000000   
mean      0.233900     0.738372  7731.038909  8074.634615  8030.673971   
std       0.423404     0.439619  2032.270314  1601.230796  1661.646908   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.000000     0.000000  7570.750000  8609.500000  8454.000000   
50%       0.000000     1.000000  8766.000000  8766.000000  8766.000000   
75%       0.000000     1.000000  8766.000000  8766.000000  8766.000000   
max       1.000000     1.000000  8766.000000  8766.000000  8766.000000   

           TIMECHD     TIMESTRK      TIMECVD      TIMEDTH      TIMEHYP  
count  2236.000000  2236.000000  2236.000000  2236.000000  2236.000000  
mean   7498.554562  8167.498211  7640.489267  8351.642665  3943.845707  
std    2255.888945  1342.934286  2138.744071   969.038296  3495.722584  
min       0.000000     0.000000     0.000000  4182.000000     0.000000  
25%    6921.250000  8673.250000  7347.500000  8766.000000     0.000000  
50%    8766.000000  8766.000000  8766.000000  8766.000000  2973.500000  
75%    8766.000000  8766.000000  8766.000000  8766.000000  7989.250000  
max    8766.000000  8766.000000  8766.000000  8766.000000  8766.000000  

[8 rows x 38 columns]

Correlation Matrix Heatmap

Code
corr = data.corr()
sns.set_theme(style="white")
f, ax = plt.subplots(figsize=(11, 9))  # Set up the matplotlib figure
cmap = sns.diverging_palette(230, 20, as_cmap=True)     # Generate a custom diverging colormap
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr,  cmap=cmap, vmin=-1, vmax=1, center=0,
        square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show();

Model Tuning

Partion data to train/test split

Code
from sklearn.model_selection import train_test_split
test_ratio=0.2
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_ratio, random_state=0)
y_train=y_train.flatten()
y_test=y_test.flatten()

print("x_train.shape        :",x_train.shape)
print("y_train.shape        :",y_train.shape)

print("X_test.shape     :",x_test.shape)
print("y_test.shape     :",y_test.shape)
x_train.shape        : (1788, 37)
y_train.shape        : (1788,)
X_test.shape     : (448, 37)
y_test.shape     : (448,)

Hyper-Parameter tuning for Decision Tree Classification ((max_depth))

Code
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier


# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS 
hyper_param=[]
train_error=[]
test_error=[]

# LOOP OVER HYPER-PARAM
for i in range(1,40):
    # INITIALIZE MODEL 
    model = DecisionTreeRegressor(max_depth=i)

    # TRAIN MODEL 
    model.fit(x_train,y_train)

    # OUTPUT PREDICTIONS FOR TRAINING AND TEST SET 
    yp_train = model.predict(x_train)
    yp_test = model.predict(x_test)

    # shift=1+np.min(y_train) #add shift to remove division by zero 
    err1=mean_absolute_error(y_train, yp_train) 
    err2=mean_absolute_error(y_test, yp_test) 
    
    # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))
    # err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test))

    hyper_param.append(i)
    train_error.append(err1)
    test_error.append(err2)

    if(i==1 or i%10==0):
        print("hyperparam =",i)
        print(" train error:",err1)
        print(" test error:" ,err2)
hyperparam = 1
 train error: 0.035268787108589356
 test error: 0.03418032867241032
hyperparam = 10
 train error: 0.02038622946162721
 test error: 0.04358912209657024
hyperparam = 20
 train error: 0.005180126792851705
 test error: 0.047937823543169375
hyperparam = 30
 train error: 0.00022663570019033476
 test error: 0.04882001990611059
  • For max_depth = 1, both the training and test errors are relatively low, with values around 0.03. This suggests that the decision tree is fitting the data well.

  • As the max_depth increase to 10 and beyond, the training erros starts to drop, however, the test error starts to go up.

  • Interestingly, as max_depth furthur increase to 20 and 30, the training error remains very low (close to 0), but the test error starts to increase slightly. This is a sign of overfitting. The model is capturing the data too well, we don’t want this to happen for generalization purposes.

  • Overall, the results show that a max_depth of around 1 to 10 appears to be a good choice because it achieves low test error without overfitting the training data.

  • Convergence plot

Code
plt.plot(hyper_param,train_error ,linewidth=2, color='k')
plt.plot(hyper_param,test_error ,linewidth=2, color='b')

plt.xlabel("Depth of tree (max depth)")
plt.ylabel("Training (black) and test (blue) MAE (error)")

i=1
print(hyper_param[i],train_error[i],test_error[i])
2 0.03442784374001432 0.032813904712789006

  • The convergence plot clearly demonstrates that as the max_depth hyperparameter increases, The point at which the lines start to diverge significantly is critical; it marks the transition from underfitting to the optimal complexity and then to overfitting. According to the plot a max_depth of 0 to 5 is optimal for a balance between training error and testing error.

Hyper-Parameter tuning for Decision Tree Regression (min_samples_splitint)

Code
# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS 
hyper_param=[]
train_error=[]
test_error=[]

# LOOP OVER HYPER-PARAM
for i in range(2,100):
    # INITIALIZE MODEL 
    model = DecisionTreeRegressor(min_samples_split=i)

    # TRAIN MODEL 
    model.fit(x_train,y_train)

    # OUTPUT PREDICTIONS FOR TRAINING AND TEST SET 
    yp_train = model.predict(x_train)
    yp_test = model.predict(x_test)

    # shift=1+np.min(y_train) #add shift to remove division by zero 
    err1=mean_absolute_error(y_train, yp_train) 
    err2=mean_absolute_error(y_test, yp_test) 
    
    # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))
    # err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test))

    hyper_param.append(i)
    train_error.append(err1)
    test_error.append(err2)

    if(i%10==0):
        print("hyperparam =",i)
        print(" train error:",err1)
        print(" test error:" ,err2)
hyperparam = 10
 train error: 0.014984530706046054
 test error: 0.045871667532753846
hyperparam = 20
 train error: 0.021962415929112074
 test error: 0.043417885579028495
hyperparam = 30
 train error: 0.026579851814848933
 test error: 0.040156857780947484
hyperparam = 40
 train error: 0.0279409667047845
 test error: 0.038751571864489465
hyperparam = 50
 train error: 0.028391822245632067
 test error: 0.038300992283321324
hyperparam = 60
 train error: 0.029240913243283456
 test error: 0.036936685875858556
hyperparam = 70
 train error: 0.029627133194660586
 test error: 0.036694713507271985
hyperparam = 80
 train error: 0.03032461525639734
 test error: 0.03656614931125601
hyperparam = 90
 train error: 0.03058836291359766
 test error: 0.036432623399717076
  • The min_samples_split parameter controls the minimum number of samples required to split an internal node during the construction of the decision tree.

  • As min_samples_split increases from 10 to 20, both the training and test errors generally decrease. This suggests that when internal nodes require a larger number of samples to split.

  • Beyond min_samples_split = 20, the training error remains low, indicating that the model still fits the training data well, the test error drops at the same time.

Convergence Plot

Code
plt.plot(hyper_param,train_error ,linewidth=2, color='k')
plt.plot(hyper_param,test_error ,linewidth=2, color='b')

plt.xlabel("Minimum number of points in split (min_samples_split)")
plt.ylabel("Training (black) and test (blue) MAE (error)")
Text(0, 0.5, 'Training (black) and test (blue) MAE (error)')

The convergence plot demonstrates that test error and train error converge to a point, roughly min_samples_spliy = 50.

Re-train with optimal parameters

Code
# INITIALIZE MODEL 
model = DecisionTreeRegressor(max_depth=1,min_samples_split=50)
model.fit(x_train,y_train)                     # TRAIN MODEL 


# OUTPUT PREDICTIONS FOR TRAINING AND TEST SET 
yp_train = model.predict(x_train)
yp_test = model.predict(x_test)

err1=mean_absolute_error(y_train, yp_train) 
err2=mean_absolute_error(y_test, yp_test) 
    
print(" train error:",err1)
print(" test error:" ,err2)
 train error: 0.03526878710858939
 test error: 0.03418032867241034

Parity Plot

  • Plotting y_pred vs y_data lets you see how good the fit is

  • The closer to the line y=x the better the fit (ypred=ydata –> prefect fit)

Code
plt.plot(y_train,yp_train ,"o", color='k')
plt.plot(y_test,yp_test ,"o", color='b')
plt.plot(y_train,y_train ,"-", color='r')

plt.xlabel("y_data")
plt.ylabel("y_pred (blue=test)(black=Train)")
Text(0, 0.5, 'y_pred (blue=test)(black=Train)')

Plot Tree

Code
from sklearn import tree
def plot_tree(model):
    fig = plt.figure(figsize=(25,20))
    _ = tree.plot_tree(model, 
                    filled=True)
    plt.show()

plot_tree(model)

Conclusions(The final reaults are discussed with each section)

  • The decision tree with regression analysis gave us insights into heart disease risks, which performed quite well when looking back at the data we trained it on. However, when faced with new, unseen data, the model’s predictions were inconsistent. This suggests that while our model has learned well from past data, it needs to be better tuned to handle new information effectively.

  • To improve, we aim to adjust our model’s complexity. When the model is too simple, and we miss important patterns, when the model is too complex, and we face the risk if overfitting instead of generalization. The goal is to build a model that not only learns from the past but is also ptovide usefull insight for generalization.

Source Code
---
title: "Decision Tree"
format:
  html:
    page-layout: full
    code-fold: true
    code-copy: true
    code-tools: true
    code-overflow: wrap
    embed-resources: true
---

```{python}
import sklearn
from sklearn import datasets
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import random
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
```

## Introductions

### Decision Trees:

A Decision Tree is a tree-like structure where each internal node represents a feature, each branch represents a decision rule, and each leaf node represents the tree outcome. The top node in a tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in a recursive manner called recursive partitioning. This tree-like structure helps in decision making.

### Random Forest:

Random Forest is an ensemble learning method, where multiple decision trees are combined to increase the overall performance. While a single decision tree might lead to overfitting, Random Forest averages multiple trees to reduce overfitting and improve prediction accuracy. Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features.

## Methods

For this part of the analysis, I will be applying both `classification` and `regression` for decision tree analysis, I will be applying the classification methods to binary data Cardiovascular dieases `(CVD)` and regressino methods to continuous numeric data `GLUCOSE`.

::: panel-tabset
# Classification

## Data and Class Distribution

For this part of the analysis, I will be using the Framingham Heart Study dataset\@RProjectFramingham for both Decision Tree and Random Forest Analysis, the target variable for this analysis would be `"CVD"`, which is Cardio Vascular diseases.

```{python}
# Load the Frammingham Heart Study data set
data = pd.read_csv("data/frmgham2.csv")
data.head()

# Split target variables and predictor variables
y = data['CVD']
x = data.drop('CVD', axis=1)
x = x.iloc[:, 1:]
x = x.values
y = y.values


# Calculating class distribution for 'DIABETES'
cd = data['CVD'].value_counts(normalize=True) * 100
print(cd)
```

The results show binary data, about roughly `75%` of data goes to 0 which is negative test for CVD, and about 25% of data goes to 1 which is positive test for CVD. We have a much higher percentages of data on negative tests, this could lead to a biased model favoring the majority class, we need a more detailed evaluation part to address this issue.

## EDA

```{python}
data_eda = data.iloc[:, 1:]
print(data_eda.describe())
```

### Correlation Matrix Heatmap

```{python}
corr = data.corr()
sns.set_theme(style="white")
f, ax = plt.subplots(figsize=(11, 9))  # Set up the matplotlib figure
cmap = sns.diverging_palette(230, 20, as_cmap=True)     # Generate a custom diverging colormap
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr,  cmap=cmap, vmin=-1, vmax=1, center=0,
        square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show();
```

## Baseline model for comparison

```{python}
## RANDOM CLASSIFIER 
def random_classifier(y_data):
    ypred=[];
    max_label=np.max(y_data); #print(max_label)
    for i in range(0,len(y_data)):
        ypred.append(int(np.floor((max_label+1)*np.random.uniform(0,1))))

    print("-----RANDOM CLASSIFIER-----")
    print("count of prediction:",Counter(ypred).values()) # counts the elements' frequency
    print("probability of prediction:",np.fromiter(Counter(ypred).values(), dtype=float)/len(y_data)) # counts the elements' frequency
    print("accuracy",accuracy_score(y_data, ypred))
    print("percision, recall, fscore,",precision_recall_fscore_support(y_data, ypred))

np.random.seed(42)    
random.seed(42)
random_classifier(y)
```

-   **Interpretation**

**Count of Prediction:** It shows the number of times each class label (0 and 1) was predicted. The values \[5776, 5851\] indicate that out of all predictions, `5874` were classified as class '0' and `5851` as class '1'.

**Probability of Prediction:** This represents the proportion of each class label in the predictions. The values `[0.50520341 0.49479659]` suggest that approximately `50%` of predictions are class '0' which is negative, and `50%` are class '1', whcih is positive. This near-even split indicates that the classifier is randomly assigning class labels with almost equal probability to each class.

**Accuracy:** An accuracy of approximately `50%` is observed. In the context of a binary classifier, an accuracy close to 50% suggests that the classifier's performance is only slightly better than random guessing. This is expected, given the nature of the random classifier.

**Precision:** Precision for each class (`0.7495744` for class '0', `0.24821832` for class '1') indicates the proportion of correctly predicted positive observations out of all predictions in that class. Higher precision for class '0' suggests that the classifier is more reliable when it predicts an instance as class '0' compared to class '1'.

**Recall:** Recall (0.49702108 for class '0', `0.50446838` for class '1') is the proportion of actual positives correctly identified. The values are close to `50%`, showing the classifier's limited capability in correctly identifying true cases for each class.

**F-Score:** F-Score is the harmonic mean of precision and recall. The values (`0.60306807` for class '0', `0.33009709` for class '1') indicate the balance between precision and recall for each class, with class '0' having a better balance compared to class '1'.

## Model Tuning

### Partion data to train/test split

```{python}
from sklearn.model_selection import train_test_split
test_ratio=0.2
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_ratio, random_state=0)
y_train=y_train.flatten()
y_test=y_test.flatten()

print("x_train.shape        :",x_train.shape)
print("y_train.shape        :",y_train.shape)

print("X_test.shape     :",x_test.shape)
print("y_test.shape     :",y_test.shape)
```

### Hyper-Parameter tuning for Decision Tree Classification ((max_depth))

```{python}
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier


# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS 
hyper_param=[]
train_error=[]
test_error=[]

# LOOP OVER HYPER-PARAM
for i in range(1,40):
    # INITIALIZE MODEL 
    model = DecisionTreeClassifier(max_depth=i)

    # TRAIN MODEL 
    model.fit(x_train,y_train)

    # OUTPUT PREDICTIONS FOR TRAINING AND TEST SET 
    yp_train = model.predict(x_train)
    yp_test = model.predict(x_test)

    # shift=1+np.min(y_train) #add shift to remove division by zero 
    err1=mean_absolute_error(y_train, yp_train) 
    err2=mean_absolute_error(y_test, yp_test) 
    
    # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))
    # err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test))

    hyper_param.append(i)
    train_error.append(err1)
    test_error.append(err2)

    if(i==1 or i%10==0):
        print("hyperparam =",i)
        print(" train error:",err1)
        print(" test error:" ,err2)

```

-   For `max_depth` = 1, both the training and test errors are relatively high, with values around 0.15. This suggests that the decision tree is underfitting the data, the depth of tree needs to higer for better model fitting.

-   As the `max_depth` increase to 10 and beyond, both the training and test errors decrease significantly. When max_depth = 10, the training error is very low (close to 0), indicating that the model is fitting the training data extremely well. However, the test error remains relatively low, suggesting that the model is generalizing reasonably well to unseen data.

-   Interestingly, as `max_depth` furthur increase to 20 and 30, the training error remains very low (close to 0), but the test error starts to increase slightly. This is a sign of overfitting. The model is capturing the data too well, we don't want this to happen for generalization purposes.

-   Overall, the results show that a `max_depth` of around 10 appears to be a good choice because it achieves low test error without overfitting the training data.

-   **Convergence plot**

```{python}
plt.plot(hyper_param,train_error ,linewidth=2, color='k')
plt.plot(hyper_param,test_error ,linewidth=2, color='b')

plt.xlabel("Depth of tree (max depth)")
plt.ylabel("Training (black) and test (blue) MAE (error)")

i=1
print(hyper_param[i],train_error[i],test_error[i])

```

-   The convergence plot clearly demonstrates that as the `max_depth` hyperparameter increases, both training and test errors demonstrate a pattern of stabilization. This observation alligns with the assumptions made in the previous section. Specifically, it indicates that beyond a certain depth, the decision tree model ceases to substantially improve its fit to the training data, and we could infer that increasing the `max_depth` further may lead to overfitting, as the model becomes overly complex and starts fitting noise in the training data, while failing to improve its generalization. Hence, the plot supports the notion that there exists an optimal `max_depth` value that strikes a balance between model complexity and performance on the test dataset.

### Hyper-Parameter tuning for Decision Tree Classification (min_samples_splitint)

```{python}
# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS 
hyper_param=[]
train_error=[]
test_error=[]

# LOOP OVER HYPER-PARAM
for i in range(2,100):
    # INITIALIZE MODEL 
    model = DecisionTreeClassifier(min_samples_split=i)

    # TRAIN MODEL 
    model.fit(x_train,y_train)

    # OUTPUT PREDICTIONS FOR TRAINING AND TEST SET 
    yp_train = model.predict(x_train)
    yp_test = model.predict(x_test)

    # shift=1+np.min(y_train) #add shift to remove division by zero 
    err1=mean_absolute_error(y_train, yp_train) 
    err2=mean_absolute_error(y_test, yp_test) 
    
    # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))
    # err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test))

    hyper_param.append(i)
    train_error.append(err1)
    test_error.append(err2)

    if(i%10==0):
        print("hyperparam =",i)
        print(" train error:",err1)
        print(" test error:" ,err2)

```

-   The `min_samples_split` parameter controls the minimum number of samples required to split an internal node during the construction of the decision tree.

-   As `min_samples_split` increases from 10 to 20, both the training and test errors generally decrease. This suggests that when internal nodes require a larger number of samples to split.

-   Beyond `min_samples_split` = 20, the training error remains low or slightly increases, indicating that the model still fits the training data well, however, the erros for test data follows a clear wave-like pattern.

-   Overall, the results suggest that a `min_samples_split` value around 30 provides a good balance between model complexity and generalization.

**Convergence Plot**

```{python}
plt.plot(hyper_param,train_error ,linewidth=2, color='k')
plt.plot(hyper_param,test_error ,linewidth=2, color='b')

plt.xlabel("Minimum number of points in split (min_samples_split)")
plt.ylabel("Training (black) and test (blue) MAE (error)")
```

The convergence plot highlights a critical observation: there exists a specific range of `min_samples_split` values where the model performs well on both the training and test datasets. However, as `min_samples_split` goes beyond this range, the model's performance on the test data deteriorates noticeably.

In practical terms, this indicates that we should seek a `min_samples_split` value that does not result in a significant drop in test data performance. The objective is to find the sweet spot where the model maintains good generalization to unseen data without overfitting or underfitting.

### Re-train with optimal parameters

```{python}
# INITIALIZE MODEL 
model = DecisionTreeClassifier(max_depth=10, min_samples_split=30)
model.fit(x_train,y_train)                     # TRAIN MODEL 


# OUTPUT PREDICTIONS FOR TRAINING AND TEST SET 
yp_train = model.predict(x_train)
yp_test = model.predict(x_test)

err1=mean_absolute_error(y_train, yp_train) 
err2=mean_absolute_error(y_test, yp_test) 
    
print(" train error:",err1)
print(" test error:" ,err2)
```

### Parity Plot

-   Plotting y_pred vs y_data lets you see how good the fit is

-   The closer to the line y=x the better the fit (ypred=ydata –\> prefect fit)

```{python}
plt.plot(y_train,yp_train ,"o", color='k')
plt.plot(y_test,yp_test ,"o", color='b')
plt.plot(y_train,y_train ,"-", color='r')

plt.xlabel("y_data")
plt.ylabel("y_pred (blue=test)(black=Train)")

```

-   **Training Data Fit:** The black dots are closer to the red line, indicating a good fit for the training data.

-   **Test Data Fit:** The blue dots are spread out, indicating more variance in the fit for the test data compared to the training data.

-   **Overall Performance:** Since some blue points are far from the red line, it suggests the model may not generalize well to new, unseen data, possibly overfitting to the training data.

### Plot Tree

```{python}
from sklearn import tree
def plot_tree(model):
    fig = plt.figure(figsize=(25,20))
    _ = tree.plot_tree(model, 
                    filled=True)
    plt.show()

plot_tree(model)
```

## Conclusions(The final reaults are discussed with each section)

-   The decision tree with classification analysis gave us insights into heart disease risks, which performed quite well when looking back at the data we trained it on. However, when faced with new, unseen data, the model's predictions were inconsistent. This suggests that while our model has learned well from past data, it needs to be better tuned to handle new information effectively.

-   To improve, we aim to adjust our model's complexity. When the model is too simple, and we miss important patterns, when the model is too complex, and we face the risk if overfitting instead of generalization. The goal is to build a model that not only learns from the past but is also ptovide usefull insight for generalization.

# Regression

## Data understanding

For this part of the analysis, I will be using the Framingham Heart Study dataset\@RProjectFramingham for both Decision Tree and Random Forest Analysis, the target variable for this analysis would be `GLUCOSE`, in mmol/L (millimoles per liter).

```{python}
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

# Load the Frammingham Heart Study data set
data = pd.read_csv("data/frmgham2.csv")
print(data.head())
data.dropna(inplace=True)

# Split target variables and predictor variables
y = data['GLUCOSE']
x = data.drop('GLUCOSE', axis=1)
x = x.iloc[:, 1:]
x = x.values
y = y.values

# Normalize 
x=0.1+(x-np.min(x,axis=0))/(np.max(x,axis=0)-np.min(x,axis=0))
y=0.1+(y-np.min(y,axis=0))/(np.max(y,axis=0)-np.min(y,axis=0))
```

## EDA

```{python}
data_eda = data.iloc[:, 1:]
print(data_eda.describe())
```

### Correlation Matrix Heatmap

```{python}
corr = data.corr()
sns.set_theme(style="white")
f, ax = plt.subplots(figsize=(11, 9))  # Set up the matplotlib figure
cmap = sns.diverging_palette(230, 20, as_cmap=True)     # Generate a custom diverging colormap
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr,  cmap=cmap, vmin=-1, vmax=1, center=0,
        square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show();
```

## Model Tuning

### Partion data to train/test split

```{python}
from sklearn.model_selection import train_test_split
test_ratio=0.2
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_ratio, random_state=0)
y_train=y_train.flatten()
y_test=y_test.flatten()

print("x_train.shape        :",x_train.shape)
print("y_train.shape        :",y_train.shape)

print("X_test.shape     :",x_test.shape)
print("y_test.shape     :",y_test.shape)
```

### Hyper-Parameter tuning for Decision Tree Classification ((max_depth))

```{python}
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier


# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS 
hyper_param=[]
train_error=[]
test_error=[]

# LOOP OVER HYPER-PARAM
for i in range(1,40):
    # INITIALIZE MODEL 
    model = DecisionTreeRegressor(max_depth=i)

    # TRAIN MODEL 
    model.fit(x_train,y_train)

    # OUTPUT PREDICTIONS FOR TRAINING AND TEST SET 
    yp_train = model.predict(x_train)
    yp_test = model.predict(x_test)

    # shift=1+np.min(y_train) #add shift to remove division by zero 
    err1=mean_absolute_error(y_train, yp_train) 
    err2=mean_absolute_error(y_test, yp_test) 
    
    # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))
    # err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test))

    hyper_param.append(i)
    train_error.append(err1)
    test_error.append(err2)

    if(i==1 or i%10==0):
        print("hyperparam =",i)
        print(" train error:",err1)
        print(" test error:" ,err2)

```

-   For `max_depth` = 1, both the training and test errors are relatively low, with values around 0.03. This suggests that the decision tree is fitting the data well.

-   As the `max_depth` increase to 10 and beyond, the training erros starts to drop, however, the test error starts to go up.

-   Interestingly, as `max_depth` furthur increase to 20 and 30, the training error remains very low (close to 0), but the test error starts to increase slightly. This is a sign of overfitting. The model is capturing the data too well, we don't want this to happen for generalization purposes.

-   Overall, the results show that a `max_depth` of around 1 to 10 appears to be a good choice because it achieves low test error without overfitting the training data.

-   **Convergence plot**

```{python}
plt.plot(hyper_param,train_error ,linewidth=2, color='k')
plt.plot(hyper_param,test_error ,linewidth=2, color='b')

plt.xlabel("Depth of tree (max depth)")
plt.ylabel("Training (black) and test (blue) MAE (error)")

i=1
print(hyper_param[i],train_error[i],test_error[i])

```

-   The convergence plot clearly demonstrates that as the `max_depth` hyperparameter increases, The point at which the lines start to diverge significantly is critical; it marks the transition from underfitting to the optimal complexity and then to overfitting. According to the plot a `max_depth` of 0 to 5 is optimal for a balance between training error and testing error.

### Hyper-Parameter tuning for Decision Tree Regression (min_samples_splitint)

```{python}
# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS 
hyper_param=[]
train_error=[]
test_error=[]

# LOOP OVER HYPER-PARAM
for i in range(2,100):
    # INITIALIZE MODEL 
    model = DecisionTreeRegressor(min_samples_split=i)

    # TRAIN MODEL 
    model.fit(x_train,y_train)

    # OUTPUT PREDICTIONS FOR TRAINING AND TEST SET 
    yp_train = model.predict(x_train)
    yp_test = model.predict(x_test)

    # shift=1+np.min(y_train) #add shift to remove division by zero 
    err1=mean_absolute_error(y_train, yp_train) 
    err2=mean_absolute_error(y_test, yp_test) 
    
    # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))
    # err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test))

    hyper_param.append(i)
    train_error.append(err1)
    test_error.append(err2)

    if(i%10==0):
        print("hyperparam =",i)
        print(" train error:",err1)
        print(" test error:" ,err2)

```

-   The `min_samples_split` parameter controls the minimum number of samples required to split an internal node during the construction of the decision tree.

-   As `min_samples_split` increases from 10 to 20, both the training and test errors generally decrease. This suggests that when internal nodes require a larger number of samples to split.

-   Beyond `min_samples_split` = 20, the training error remains low, indicating that the model still fits the training data well, the test error drops at the same time.

**Convergence Plot**

```{python}
plt.plot(hyper_param,train_error ,linewidth=2, color='k')
plt.plot(hyper_param,test_error ,linewidth=2, color='b')

plt.xlabel("Minimum number of points in split (min_samples_split)")
plt.ylabel("Training (black) and test (blue) MAE (error)")
```

The convergence plot demonstrates that test error and train error converge to a point, roughly `min_samples_spliy = 50`.

### Re-train with optimal parameters

```{python}
# INITIALIZE MODEL 
model = DecisionTreeRegressor(max_depth=1,min_samples_split=50)
model.fit(x_train,y_train)                     # TRAIN MODEL 


# OUTPUT PREDICTIONS FOR TRAINING AND TEST SET 
yp_train = model.predict(x_train)
yp_test = model.predict(x_test)

err1=mean_absolute_error(y_train, yp_train) 
err2=mean_absolute_error(y_test, yp_test) 
    
print(" train error:",err1)
print(" test error:" ,err2)
```

### Parity Plot

-   Plotting y_pred vs y_data lets you see how good the fit is

-   The closer to the line y=x the better the fit (ypred=ydata –\> prefect fit)

```{python}
plt.plot(y_train,yp_train ,"o", color='k')
plt.plot(y_test,yp_test ,"o", color='b')
plt.plot(y_train,y_train ,"-", color='r')

plt.xlabel("y_data")
plt.ylabel("y_pred (blue=test)(black=Train)")

```

### Plot Tree

```{python}
from sklearn import tree
def plot_tree(model):
    fig = plt.figure(figsize=(25,20))
    _ = tree.plot_tree(model, 
                    filled=True)
    plt.show()

plot_tree(model)
```

## Conclusions(The final reaults are discussed with each section)

-   The decision tree with regression analysis gave us insights into heart disease risks, which performed quite well when looking back at the data we trained it on. However, when faced with new, unseen data, the model's predictions were inconsistent. This suggests that while our model has learned well from past data, it needs to be better tuned to handle new information effectively.

-   To improve, we aim to adjust our model's complexity. When the model is too simple, and we miss important patterns, when the model is too complex, and we face the risk if overfitting instead of generalization. The goal is to build a model that not only learns from the past but is also ptovide usefull insight for generalization.
:::